Model Selection

Multimodal Generation

# Multimodal Generation

Blip Arabic Flickr 8k

Arabic image captioning model fine-tuned based on BLIP architecture, specifically optimized for the Flickr8k Arabic dataset

Transformers Supports Multiple Languages

GLM 4 32B 0414 GGUF

GLM-4-32B-0414 is a large language model with 32 billion parameters, comparable in performance to GPT-4o and DeepSeek-V3. It supports both Chinese and English, and excels in code generation, function calling, and complex task processing.

Large Language Model Supports Multiple Languages

Instancecap Captioner

A visual language model fine-tuned on the instancevid dataset based on Qwen2.5-VL-7B-Instruct, specializing in instance-level image description generation

GLM-4-32B-0414 is a large language model with 32 billion parameters, comparable in performance to the GPT series, supporting both Chinese and English, and excels in code generation, function calling, and complex task processing.

Large Language Model

Transformers Supports Multiple Languages

Llama 3.2 Vision Instruct Bpmncoder

Llama 3.2 11B vision instruction fine-tuned model optimized with Unsloth, using 4-bit quantization technology, achieving 2x faster training speed

Transformers English

Vit Gpt2 Image Captioning

This is an image captioning model based on ViT and GPT2 architectures, capable of generating natural language descriptions for input images.

A video-text generation model developed based on VILA-v1.5-13B, capable of generating fine-grained descriptive text for input videos that aligns with human preferences.

Llama 3.2 11B Vision Invoices Mini

A multimodal large language model fine-tuned based on unsloth/llama-3.2-11b-vision-instruct-unsloth-bnb-4bit, supporting visual instruction understanding tasks, with Unsloth optimization doubling training speed.

Transformers English

This is a LoRA trained for the Wan2.1 14B video generation model, suitable for text-to-video and image-to-video tasks.

Video Processing Supports Multiple Languages

Liquid is an autoregressive generation paradigm that achieves seamless fusion of visual understanding and generation by tokenizing images into discrete codes and learning these code embeddings alongside text tokens in a shared feature space.

Transformers English

Molmo 7B D 0924 NF4

The 4Bit quantized version of Molmo-7B-D-0924, which reduces VRAM usage through the NF4 quantization strategy and is suitable for environments with limited VRAM.

Mini Image Captioning

A lightweight image captioning model based on bert-mini and vit-small, weighing only 130MB, with extremely fast performance on CPU.

Transformers English

Janus Pro 1B ONNX

Janus-Pro-1B is a multimodal causal language model that supports various tasks such as text-to-image and image-to-text.

LongVA-7B-TPO is a video-text model derived from LongVA-7B through temporal preference optimization, excelling in long video understanding tasks.

Hunyuanvideo HFIE

Tencent Hunyuan Video is a text-to-video generation model, compatible with Hugging Face inference endpoints.

Text-to-Video English

Instructcir Llava Phi35 Clip224 Lp

InstructCIR is an instruction-aware contrastive learning-based compositional image retrieval model, utilizing ViT-L-224 and Phi-3.5-Mini architectures, focusing on image-text-to-text generation tasks.

Captain Eris Violet V0.420 12B

Captain Violet is a 12B-parameter merged model, created by combining Epiculous/Violet_Twilight-v0.2 and Nitral-AI/Captain_BMO-12B using the mergekit tool, supporting text generation tasks.

Large Language Model

Transformers English

Cogvideox 2B LiFT

CogVideoX-2B-LiFT is a text-to-video generation model fine-tuned from CogVideoX-1.5 using reward-weighted learning methods

Text-to-Video English

Llama 3.2 11B Vision Radiology Mini

Vision instruction fine-tuned model optimized with Unsloth, supporting multimodal task processing

Transformers English

Thaicapgen Clip Gpt2

An encoder-decoder model based on CLIP encoder and GPT2 architecture for generating Thai image descriptions

Image-to-Text Other

Janus 1.3B ONNX

Janus-1.3B is a multimodal causal language model that supports text-to-image, image-to-text, and image-text-to-text conversion tasks.

OmniGen is a unified image generation model that supports multiple image generation tasks.

Image Generation

Emu3 is a multimodal model developed by the Beijing Academy of Artificial Intelligence, trained solely by predicting the next token, supporting image, text, and video processing.

Sd15.ip Adapter.plus

An image-to-image adapter based on IP-Adapter technology for the Stable Diffusion SD1.5 model, supporting artistic image generation via image prompts.

Image Generation Other

AiM is an unconditional image generation model based on PyTorch, integrated and pushed to Hugging Face Hub via PytorchModelHubMixin.

Image Generation

Cogflorence 2.2 Large

This model is a fine-tuned version of microsoft/Florence-2-large, trained on a 40,000-image subset of the Ejafa/ye-pop dataset, with annotation texts generated by THUDM/cogvlm2-llama3-chat-19B, suitable for image-to-text tasks.

Transformers Supports Multiple Languages

Lumina Mgpt 7B 1024

Lumina-mGPT is a family of multimodal autoregressive models, excelling in generating flexible and realistic images from text descriptions and capable of performing various vision and language tasks.

Lumina Mgpt 7B 768

Lumina-mGPT is a family of multimodal autoregressive models, excelling in generating flexible and realistic images from text descriptions, and capable of performing various vision and language tasks.

Lumina Mgpt 7B 768 Omni

Lumina-mGPT is a series of multimodal autoregressive models, excelling in generating flexible and realistic images from text descriptions.

Cogflorence 2.1 Large

This model is a fine-tuned version of microsoft/Florence-2-large, trained on a subset of 40,000 images from the Ejafa/ye-pop dataset, with annotations generated by THUDM/cogvlm2-llama3-chat-19B, focusing on image-to-text tasks.

Transformers Supports Multiple Languages

Latte is a Transformer-based latent diffusion model focused on text-to-video generation tasks, supporting pre-trained weights for multiple datasets.

Shotluck Holmes 1.5

Shot2Story-20K is an image-to-text generation model capable of converting input images into coherent textual descriptions or stories.

Transformers English

Vit Base Patch16 224 Turkish Gpt2

This is a vision encoder-decoder model based on ViT and Turkish GPT2 for generating Turkish image descriptions.

Transformers Other

VLM WebSight Finetuned

This model converts screenshots of website components into HTML/CSS code, developed based on an early checkpoint of a vision-language foundation model

Transformers Supports Multiple Languages

An open-source video synthesis codebase developed by Alibaba's Tongyi Lab, integrating multiple advanced video generation models

ShareCaptioner is an open-source image description generation model. It is based on the improved InternLM-Xcomposer-7B base model and fine-tuned on the ShareGPT4V dataset assisted by GPT4-Vision. It can generate high-quality image descriptions.

Textdiffuser2 Layout Planner

TextDiffuser-2 is a text-to-image generation model focused on text rendering tasks, leveraging the potential of language models to generate images containing text.

This model can generate high-quality images based on input text descriptions, suitable for various scenarios such as creative design and content creation.

Image Caption Using ViT GPT2

This is an image captioning model based on Vision Transformer (ViT) and GPT2 architectures, capable of generating natural language descriptions for input images.

Biomedgpt LM 7B

BioMedGPT-LM-7B is the first large-scale generative language model in the biomedical field based on Llama2, specializing in biomedical text generation and Q&A tasks.

Large Language Model

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase